{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tree Methods Code Along\n",
    "\n",
    "In this lecture we will code along with some data and test out 3 different tree methods:\n",
    "\n",
    "* A single decision tree\n",
    "* A random forest\n",
    "* A gradient boosted tree classifier\n",
    "    \n",
    "We will be using a college dataset to try to classify colleges as Private or Public based off these features:\n",
    "\n",
    "    Private A factor with levels No and Yes indicating private or public university\n",
    "    Apps Number of applications received\n",
    "    Accept Number of applications accepted\n",
    "    Enroll Number of new students enrolled\n",
    "    Top10perc Pct. new students from top 10% of H.S. class\n",
    "    Top25perc Pct. new students from top 25% of H.S. class\n",
    "    F.Undergrad Number of fulltime undergraduates\n",
    "    P.Undergrad Number of parttime undergraduates\n",
    "    Outstate Out-of-state tuition\n",
    "    Room.Board Room and board costs\n",
    "    Books Estimated book costs\n",
    "    Personal Estimated personal spending\n",
    "    PhD Pct. of faculty with Ph.D.’s\n",
    "    Terminal Pct. of faculty with terminal degree\n",
    "    S.F.Ratio Student/faculty ratio\n",
    "    perc.alumni Pct. alumni who donate\n",
    "    Expend Instructional expenditure per student\n",
    "    Grad.Rate Graduation rate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Tree methods Example\n",
    "from pyspark.sql import SparkSession\n",
    "spark = SparkSession.builder.appName('treecode').getOrCreate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Load training data\n",
    "data = spark.read.csv('College.csv',inferSchema=True,header=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- School: string (nullable = true)\n",
      " |-- Private: string (nullable = true)\n",
      " |-- Apps: integer (nullable = true)\n",
      " |-- Accept: integer (nullable = true)\n",
      " |-- Enroll: integer (nullable = true)\n",
      " |-- Top10perc: integer (nullable = true)\n",
      " |-- Top25perc: integer (nullable = true)\n",
      " |-- F_Undergrad: integer (nullable = true)\n",
      " |-- P_Undergrad: integer (nullable = true)\n",
      " |-- Outstate: integer (nullable = true)\n",
      " |-- Room_Board: integer (nullable = true)\n",
      " |-- Books: integer (nullable = true)\n",
      " |-- Personal: integer (nullable = true)\n",
      " |-- PhD: integer (nullable = true)\n",
      " |-- Terminal: integer (nullable = true)\n",
      " |-- S_F_Ratio: double (nullable = true)\n",
      " |-- perc_alumni: integer (nullable = true)\n",
      " |-- Expend: integer (nullable = true)\n",
      " |-- Grad_Rate: integer (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "data.printSchema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Row(School='Abilene Christian University', Private='Yes', Apps=1660, Accept=1232, Enroll=721, Top10perc=23, Top25perc=52, F_Undergrad=2885, P_Undergrad=537, Outstate=7440, Room_Board=3300, Books=450, Personal=2200, PhD=70, Terminal=78, S_F_Ratio=18.1, perc_alumni=12, Expend=7041, Grad_Rate=60)"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Spark Formatting of Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# A few things we need to do before Spark can accept the data!\n",
    "# It needs to be in the form of two columns\n",
    "# (\"label\",\"features\")\n",
    "\n",
    "# Import VectorAssembler and Vectors\n",
    "from pyspark.ml.linalg import Vectors\n",
    "from pyspark.ml.feature import VectorAssembler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['School',\n",
       " 'Private',\n",
       " 'Apps',\n",
       " 'Accept',\n",
       " 'Enroll',\n",
       " 'Top10perc',\n",
       " 'Top25perc',\n",
       " 'F_Undergrad',\n",
       " 'P_Undergrad',\n",
       " 'Outstate',\n",
       " 'Room_Board',\n",
       " 'Books',\n",
       " 'Personal',\n",
       " 'PhD',\n",
       " 'Terminal',\n",
       " 'S_F_Ratio',\n",
       " 'perc_alumni',\n",
       " 'Expend',\n",
       " 'Grad_Rate']"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "assembler = VectorAssembler(\n",
    "  inputCols=['Apps',\n",
    "             'Accept',\n",
    "             'Enroll',\n",
    "             'Top10perc',\n",
    "             'Top25perc',\n",
    "             'F_Undergrad',\n",
    "             'P_Undergrad',\n",
    "             'Outstate',\n",
    "             'Room_Board',\n",
    "             'Books',\n",
    "             'Personal',\n",
    "             'PhD',\n",
    "             'Terminal',\n",
    "             'S_F_Ratio',\n",
    "             'perc_alumni',\n",
    "             'Expend',\n",
    "             'Grad_Rate'],\n",
    "              outputCol=\"features\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "output = assembler.transform(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Deal with Private column being \"yes\" or \"no\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import StringIndexer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "indexer = StringIndexer(inputCol=\"Private\", outputCol=\"PrivateIndex\")\n",
    "output_fixed = indexer.fit(output).transform(output)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "final_data = output_fixed.select(\"features\",'PrivateIndex')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train_data,test_data = final_data.randomSplit([0.7,0.3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The Classifiers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.classification import DecisionTreeClassifier,GBTClassifier,RandomForestClassifier\n",
    "from pyspark.ml import Pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create all three models:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Use mostly defaults to make this comparison \"fair\"\n",
    "\n",
    "dtc = DecisionTreeClassifier(labelCol='PrivateIndex',featuresCol='features')\n",
    "rfc = RandomForestClassifier(labelCol='PrivateIndex',featuresCol='features')\n",
    "gbt = GBTClassifier(labelCol='PrivateIndex',featuresCol='features')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Train all three models:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Train the models (its three models, so it might take some time)\n",
    "dtc_model = dtc.fit(train_data)\n",
    "rfc_model = rfc.fit(train_data)\n",
    "gbt_model = gbt.fit(train_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Comparison\n",
    "\n",
    "Let's compare each of these models!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "dtc_predictions = dtc_model.transform(test_data)\n",
    "rfc_predictions = rfc_model.transform(test_data)\n",
    "gbt_predictions = gbt_model.transform(test_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Evaluation Metrics:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.evaluation import MulticlassClassificationEvaluator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Select (prediction, true label) and compute test error\n",
    "acc_evaluator = MulticlassClassificationEvaluator(labelCol=\"PrivateIndex\", predictionCol=\"prediction\", metricName=\"accuracy\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "dtc_acc = acc_evaluator.evaluate(dtc_predictions)\n",
    "rfc_acc = acc_evaluator.evaluate(rfc_predictions)\n",
    "gbt_acc = acc_evaluator.evaluate(gbt_predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Here are the results!\n",
      "--------------------------------------------------------------------------------\n",
      "A single decision tree had an accuracy of: 91.53%\n",
      "--------------------------------------------------------------------------------\n",
      "A random forest ensemble had an accuracy of: 91.53%\n",
      "--------------------------------------------------------------------------------\n",
      "A ensemble using GBT had an accuracy of: 91.95%\n"
     ]
    }
   ],
   "source": [
    "print(\"Here are the results!\")\n",
    "print('-'*80)\n",
    "print('A single decision tree had an accuracy of: {0:2.2f}%'.format(dtc_acc*100))\n",
    "print('-'*80)\n",
    "print('A random forest ensemble had an accuracy of: {0:2.2f}%'.format(rfc_acc*100))\n",
    "print('-'*80)\n",
    "print('A ensemble using GBT had an accuracy of: {0:2.2f}%'.format(gbt_acc*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Interesting! Optional Assignment - play around with the parameters of each of these models, can you squeeze some more accuracy out of them? Or is the data the limiting factor?"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
